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Abstract 

Charmm  (Chemistry  at  Harvard  Macromolecular  Mechanics)  is  a  program  that  is  widely  used  to  model 
and  simulate  macromolecular  systems.  Charmm  has  been  parallelized  by  using  the  CHAOS  runtime 
support  library  on  distributed  memory  architectures.  This  implementation  distributes  both  data  and 
computations  over  processors.  This  data-parallel  strategy  should  make  it  possible  to  simulate  very  large 
molecules  on  large  numbers  of  processors. 

In  order  to  minimize  communication  among  processors  and  to  balance  computational  load,  a  variety 
of  partitioning  approaches  are  employed  to  distribute  the  atoms  and  computations  over  processors.  In 
this  implementation,  atoms  are  partitioned  based  on  geometrical  positions  and  computational  load  by 
using  unweighted  or  weighted  recursive  coordinate  bisection.  The  experimental  results  reveal  that 
taking  computational  load  into  account  is  essential.  The  performance  of  two  iteration  partitioning 
algorithms,  atom  decomposition  and  force  decomposition,  is  also  compared.  A  new  irregular  force 
decomposition  algorithm  is  introduced  and  implemented. 

The  CHAOS  library  is  designed  to  facilitate  parallelization  of  irregular-  applications.  This  library  (1) 
couples  partitioners  to  the  application  programs,  (2)  remaps  data  and  partitions  work  among  processors, 
and  (3)  optimizes  interprocessor  communications.  This  paper  presents  an  application  of  CHAOS  that 
can  be  used  to  support  efficient  execution  of  irregular  problems  on  distributed  memory  machines. 


‘This  work  was  sponsored  in  part  by  ARPA  (NAG-1-1485)  and  NSF  (ASC  9213821). 


1  Introduction 


Molecular  dynamics  (MD)  simulation  programs,  such  as  Charmm  [1],  Gromos  [2],  and  Amber  [3], 
are  useful  to  study  the  structural,  equilibrium,  and  dynamic  properties  of  molecules.  These  programs  are 
very  complicated  and  computationally  intensive.  Implementing  them  on  massively  parallel  processing 
systems  not  only  reduces  execution  times  but  also  allows  users  to  solve  larger  problems.  In  the  past, 
a  number  of  algorithms  have  been  designed  and  implemented  for  Mimd  architectures  [4].  However, 
certain  difficulties  arise  when  trying  to  achieve  high  performance  with  large  numbers  of  processors,  due 
to  the  irregular  computational  structure  of  MD  programs.  This  paper  presents  an  implementation  of  the 
Charmm  program  that  scales  well  on  Mimd  distributed  memory  machines.  This  program  is  parallelized 
using  a  set  of  efficient  runtime  primitives  called  the  CHAOS  runtime  support  library  [5]. 

Several  parallel  MD  algorithms  use  the  replicated  approach,  i.e.  the  entire  system's  coordinates  and 
forces  are  stored  at  each  processor  [4].  The  replicated  approach  eliminates  the  overhead  in  maintaining 
the  data  distributions,  but  it  would  require  more  memory  space.  This  prohibits  users  from  simulating 
larger  systems.  Furthermore,  the  method  is  not  scalable  for  large  numbers  of  processors  because  each 
processor  has  to  communicate  all  coordinates  and  forces  to  all  other  processors  in  order  to  make  them 
consistent  at  the  end  of  each  time  step.  In  this  paper,  spatial  domain  decomposition  algorithms  have 
been  applied  to  distribute  data  and  computations  over  processors  based  on  the  coordinates  of  atoms  in 
the  system.  This  approach  provides  better  locality,  load  balance,  and  scalability. 

Molecular  dynamics  simulates  the  local  and  global  motion  of  atoms  in  molecules.  Each  of  the  N 
atoms  in  the  simulation  is  treated  as  a  point  mass.  Simulations  involve  the  iterative  computation  of 
the  total  potential  energy,  forces,  and  coordinates  for  each  atom  in  the  system  in  each  time  step.  The 
potential  energy  includes  bonded  energy,  which  consists  of  bond  energy,  angle  energy,  and  dihedral 
angle  energy,  and  non-bonded  energy,  which  includes  van  der  Waals  and  electrostatic  potential.  The 
bonded  force  calculations  consume  only  a  small  part  of  computation  time  because  each  atom  has  few 
numbers  of  chemical  bonds  with  other  atoms,  whereas  the  non-bonded  force  calculations  consume  most 
of  MD  simulation  time  since  each  atom  interacts  with  all  other  atoms  in  the  system.  In  this  paper,  two 
approaches,  data  decomposition  and  force  decomposition  [6],  have  been  employed  to  parallelize  the 
non-bonded  force  calculations.  Although  the  regular  force  decomposition  algorithm  partitions  atoms 
evenly  over  processors,  the  non-bonded  list  is  distributed  unevenly  and  hence  severe  load  imbalance 
occurs.  This  paper  presents  an  irregular  force  decomposition  algorithm  that  maintains  good  load  balance 
and  achieves  good  performance. 

The  CHAOS  library  is  designed  to  facilitate  parallelization  of  irregular  applications  on  distributed 
memory  multiprocessor  systems.  It  is  a  superset  of  the  Parti  library  [7],  This  library  (1)  couples 
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partitioned  to  the  application  programs,  (2)  remaps  data  and  partitions  work  among  processors,  and 
(3)  optimizes  interprocessor  communications.  This  implementation  of  parallel  Charmm  presents  an 
application  of  CHAOS  that  can  be  used  to  support  efficient  execution  of  irregular  problems  on  distributed 
memory  machines. 

In  the  next  section,  a  brief  description  of  molecular  dynamics  simulations  is  given.  In  Section  3,  the 
design  philosophy  and  functionality  of  the  Maryland  CHAOS  library  is  described.  Section  4  presents  the 
parallelization  approaches  and  optimizations.  The  performance  of  the  parallel  Charmm  is  presented  in 
Section  5.  Section  6  briefly  describes  related  work.  Conclusions  are  presented  in  Section  7. 

2  Molecular  Dynamics  (Charmm) 

Charmm  is  a  program  which  calculates  empirical  energy  functions  to  model  macromolecular  systems. 
The  purpose  of  Charmm  is  to  derive  structural  and  dynamic  properties  of  molecules  using  the  first  and 
second  order  derivative  techniques  [1], 

The  computationally  intensive  part  of  Charmm  is  the  molecular  dynamics  simulation.  This  cal¬ 
culation  is  usually  performed  when  the  molecular  structure  reaches  a  local  stable  state  (low  energy) 
through  energy  minimization.  It  simulates  the  dynamic  interactions  among  all  atoms  in  the  system  for 
a  period  of  time.  For  each  time  step,  the  simulation  calculates  the  forces  between  atoms,  the  energy  of 
the  whole  structure,  and  the  movements  of  atoms  by  integrating  Newton’s  equations  of  motion 

d2Xi  _  VE(xi) 
dt2  nii 

It  then  updates  the  coordinates  of  the  atoms.  The  positions  of  the  atoms  are  fixed  during  the  energy 
calculation;  however,  they  are  updated  when  spatial  displacements  due  to  forces  are  calculated. 

The  loop  structure  of  molecular  dynamics  simulation  is  shown  in  Figure  1 .  The  physical  values 
associated  with  atoms,  such  as  velocity,  force  and  displacement,  are  accessed  using  indirection  arrays 
(IB,  JB,  IT,  JT,  KT,  JNB).  IB  and  JB  list  all  the  bonds  in  the  system;  IB(I)  and  JB(I)  are  the  atoms 
connected  by  bond  /.  IT,  JT,  and  KT  represent  all  the  angles;  IT(I),  JT(I),  and  KT(I)  are  the  atoms  that 
construct  angle  I.  JNB  is  the  array  that  stores  the  non-bonded  lists  of  all  atoms.  The  energy  calculation 
in  the  molecular  dynamics  simulation  comprises  two  types  of  interactions  -  bonded  and  non-bonded. 

Bonded  forces  exist  between  atoms  connected  by  chemical  bonds.  Charmm  calculates  four  types 
of  bonded  forces  -  bond  potential,  bond  angle  potential,  dihedral  angle  (torsion)  potential,  and  improper 
torsion.  These  forces  are  short-range,  i.e.  forces  existing  between  atoms  that  lie  close  to  each  other  in 
space.  Bonded  interactions  remain  unchanged  during  the  entire  simulation  process  because  the  chemical 
bonds  of  structures  do  not  change.  The  time  complexity  of  bonded  forces  calculations  is  linear  to  the 
number  of  atoms  because  each  atom  has  a  finite  number  of  bonds  with  other  atoms. 
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LI:  DON=l,nsteps 

Regenerate  non-bonded  list  if  required 

C  Bonded  Force  Calculations 

L2:  DO  1=1,  NBONDS 

Calculate  bond  energy  between  atoms  IB(I)  and  JB(I) 

END  DO 

L3:  DO  1=1,  NANGLES 

Calculate  angle  energy  of  atoms  IT(I),  JT(I),  and  KT(I) 

END  DO 

C  Non-Bonded  Force  Calculation 

L4:  DO  1=1,  NATOMS 

DO  J  =  INBLO(I)+l,  INBLO(I+l) 

Calculate  non-bonded  force  between  atoms  I  and  JNB(J) 

END  DO 
END  DO 

Integrate  Newton’s  Equations  and  Update  Atom  Coordinates 
END  DO 

Figure  1:  Dynamics  Simulation  Code  from  Charmm 

Non-bonded  forces  are  due  to  the  van  der  Waals  interactions  and  electrostatic  potential  between  all 
pairs  of  atoms.  The  time  complexity  of  non-bonded  forces  calculations  is  0(N 2)  because  each  atom 
interacts  with  all  other  atoms  in  the  system.  When  simulating  large  molecular  structures,  Charmm 
approximates  this  calculation  by  ignoring  all  interactions  beyond  a  certain  cutoff  range.  This  approx¬ 
imation  is  done  by  generating  a  non-bonded  list,  array  JNB,  which  contains  all  pairs  of  interactions 
within  the  cutoff  range.  The  non-bonded  list  determines  the  pattern  of  data  accesses  in  the  non-bonded 
force  calculation  loop.  The  spatial  positions  of  the  atoms  change  after  each  time  step,  consequently,  the 
same  set  of  atoms  may  not  be  included  in  subsequent  time  steps.  Flence,  a  new  non-bonded  list  must 
be  generated.  Flowever,  in  Charmm,  users  have  control  over  non-bonded  list  regeneration  frequency. 
If  users  do  not  specify  the  non-bonded  list  regeneration  frequency,  Charmm  contains  a  heuristics  test 
that  can  be  used  to  decide  when  it  is  necessary  to  regenerate  the  list. 

3  Chaos  Runtime  Library 

The  CHAOS  library  is  a  set  of  software  primitives  that  are  designed  to  efficiently  handle  irregular 
problems  on  distributed  memory  systems.  It  is  a  superset  of  the  PARTI  library  [7] .  These  primitives  have 
been  designed  to  ease  the  implementation  of  computational  problems  on  parallel  architecture  machines 
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by  relieving  users  of  low-level  machine  specific  issues.  The  design  philosophy  has  been  to  leave  the 
original  (sequential)  source  codes  essentially  unaltered,  with  the  exception  of  the  introduction  of  various 
calls  to  the  CHAOS  primitives  which  are  embedded  in  the  codes  at  the  appropriate  locations.  These 
primitives  allow  the  distribution  and  retrieval  of  data  from  the  numerous  processor  local  memories. 

In  distributed  memory  systems,  arrays  can  be  partitioned  among  local  memories  of  processors. 
These  partitioned  arrays  are  called  distributed  arrays.  Users  can  either  partition  arrays  regularly,  e.g. 
in  block  or  cyclic,  or  specify  irregular  mapping  of  arrays  to  processors.  The  distributions  of  data  arrays 
have  significant  impact  on  performance  because  they  determine  the  patterns  of  communication  between 
processors.  It  is  frequently  advantageous  to  partition  arrays  in  appropriate  irregular  distributions. 
CHAOS  provides  primitives  to  map  arrays  onto  processors  and  uses  translation  tables  to  describe  the 
distributions  of  distributed  arrays. 

CHAOS  also  provides  a  set  of  primitives  that  carry  out  optimizations  at  runtime  to  reduce  both  the 
number  of  messages  and  the  volume  of  interprocessor  communication.  The  runtime  primitives  examine 
the  data  accesses  made  by  each  processor,  calculate  what  off-processor  data  have  to  be  fetched  and 
where  the  data  will  be  stored,  and  then  generate  communication  schedules  to  specify  the  patterns  of 
communication  for  exchanging  data.  Once  the  communication  schedules  are  ready,  efficient  gather  and 
scatter  routines  can  be  used  to  exchange  data  between  processors. 

4  Parallelization 

In  distributed  memory  machines  the  data  and  the  computational  work  must  be  partitioned  among 
individual  processors.  The  criteria  for  this  partitioning  is  to  reduce  the  volume  of  interprocessor 
data  communication  and  also  to  ensure  good  load  balance.  This  corresponds  to  partitioning  data 
and  iterations,  such  as  atoms,  bonded  force  calculations,  and  non-bonded  force  calculations,  and  then 
assigning  them  to  processors.  Therefore,  the  parallelization  efforts  consist  of  four  parts: 

•  Data  Partition  -  partitioning  atoms  over  processors, 

•  Iteration  Partition  -  partitioning  bonded  and  non-bonded  force  calculations, 

•  Loop  Preprocessing  -  translating  indexes  from  global  to  local  and  generating  communication 
schedules,  and 

•  Communication  Optimization  -  reducing  both  the  number  and  volume  of  interprocessor  commu¬ 
nication  messages. 
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(a)  Block  Distribution 


(b)  Irregular  Distribution 


Figure  2:  Distributions  of  Flydrated  Carboxy-Myoglobin  (MbCO)  on  4  Processors 

4.1  Data  Partition 

The  way  in  which  the  atoms  are  numbered  frequently  does  not  have  a  useful  correspondence  to  the 
interaction  pattern  of  the  molecules.  Naive  data  distributions  such  as  BLOCK  or  Cyclic  may  result  in 
a  high  volume  of  communication  and  poor  load  balance.  In  molecular  dynamics  programs,  bonded 
interactions  occur  only  between  atoms  in  close  proximity  to  each  other  and  non-bonded  interactions 
are  excluded  beyond  a  certain  cutoff  range.  Therefore,  it  is  reasonable  to  assign  atoms  that  are  close 
to  each  other  to  the  same  processor.  Additionally,  the  amount  of  computation  associated  with  an  atom 
depends  on  the  number  of  atoms  with  which  it  interacts.  Hence,  data  partitioners  such  as  recursive 
coordinate  bisection  (Rcb)  and  recursive  inertial  bisection  (Rib),  which  use  spatial  information  as  well 
as  computational  load,  are  good  candidates  to  partition  atoms  over  processors.  Figure  2  depicts  two 
distributions  of  hydrated  carboxy-myoglobin  (MbCO)  on  4  processors:  (a)  a  BLOCK  distribution  and 
(b)  an  irregular  distribution  generated  by  a  Rcb  partitioner. 

Bonded  force  calculations  consume  about  1%  of  the  total  execution  time  of  energy  calculation, 
while  non-bonded  calculations  consume  over  90%.  Hence,  balancing  the  computational  load  due  to 
non-bonded  calculations  is  of  primary  concern.  The  computation  work  of  non-bonded  force  calculation 
for  each  atom  is  proportional  to  the  size  of  the  non-bonded  list  for  the  atom.  Therefore,  the  size  of 
the  non-bonded  list  for  each  atom  is  a  good  estimate  of  workload.  Partitioners  examine  the  position 
and  the  estimated  workload  of  each  atom  and  distribute  atoms  onto  processors  so  that  interprocessor 
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(a)  Atom  Decomposition 


(b)  Force  Decomposition 


Figure  3:  Decomposition  of  Non-Bonded  Force  Matrix  F 

communication  is  minimal  and  load  is  balanced.  Once  the  distribution  of  atoms  is  determined,  all  data 
arrays  that  are  associated  with  atoms  are  distributed  identically. 

4.2  Iteration  Partition 

Once  atoms  are  partitioned,  the  distribution  is  used  to  decide  how  loop  iterations  of  energy  calculation 
are  partitioned  over  processors.  Each  loop  index  of  bonded  force  calculations  is  assigned  to  the  processor 
that  owns  the  majority  of  data  array  elements  accessed  in  that  iteration.  If  the  choice  of  processor  is 
not  unique,  the  processor  with  the  lowest  computational  load  is  chosen  to  maintain  load  balance.  The 
partitioning  of  loop  interactions  for  bonded  force  calculations  does  not  impact  performance  much,  but 
the  partitioning  for  non-bonded  force  calculations  impacts  significantly. 

Since  non-bonded  forces  are  the  interactions  between  all  pairs  of  atoms,  the  interactions  can  be 
represented  by  an  N  X  N  force  matrix  F,  where  N  is  the  number  of  atoms  in  the  system.  Every 
non-zero  element  Fij  of  the  force  matrix  represents  the  force  on  atom  i  due  to  atom  j.  This  matrix  F  is 
skew-symmetric  due  to  Newton’s  third  law,  Fij  =  —  FJt.  Therefore,  only  the  upper  triangular  matrix  is 
required  to  calculate  the  non-bonded  forces.  Since  the  non-bonded  force  calculations  can  be  represented 
by  force  matrix  /  ',  partitioning  iterations  of  non-bonded  force  calculations  is  equivalent  to  partitioning 
force  matrix  F.  Currently,  two  different  iteration  partitioning  approaches,  atom  decomposition  and 
force  decomposition,  have  been  implemented. 
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4.2.1  Atom  Decomposition 

In  the  force  matrix  /  ',  row  i  represents  the  list  of  atoms  that  interact  with  atom  i,  i.e.  row  i  is  the 
non-bonded  list  of  atom  i.  Hence,  one  straightforward  way  to  partition  iterations  of  non-bonded  force 
calculations  is  to  assign  the  non-bonded  list  of  each  atom  to  the  processor  that  owns  the  atom.  That  is, 
each  row  Fi  of  the  force  matrix  F  is  assigned  to  the  processor  that  owns  atom  i.  This  is  equivalent  to 
distributing  the  rows  of  F  onto  processors.  Figure  3(a)  shows  the  partitioning  of  force  matrix  F.  Row 
F{  is  assigned  to  processor  p  that  owns  atom  Xt .  Processor  p  requires  the  coordinates  of  atom  X{  and 
atoms  Xt+ 1  to  Xm  to  compute  the  non-bonded  force  j, . 

Each  processor  has  N/P  atoms  where  N  is  the  number  of  atoms  and  P  is  the  number  of  processors. 
Since  the  non-bonded  forces  are  long-range  interactions,  each  atom  interacts  with  all  other  N  —  1 
atoms.  Hence,  each  process  has  to  access  close  to  N  -  N/P  off -processor  atoms,  i.e.  the  order  of 
the  communication  cost  for  each  processor  is  O(N).  Therefore,  this  row-based  decomposition  will 
cause  0(P  X  N)  off -processor  accesses.  The  communication  overhead  increases  when  the  number 
of  processors  increases.  A  better  decomposition  scheme  is  required  to  reduce  the  communication 
overheads. 

4.2.2  Force  Decomposition 

Force  decomposition  is  a  block- wise  decomposition  of  the  non-bonded  force  matrix  rather  than  a  row¬ 
wise  decomposition  as  used  in  the  atom  decomposition  scheme.  The  communication  cost  for  each 
processor  reduces  from  O(N)  to  0(N / \/P).  The  assignment  of  sub-blocks  of  force  matrix  F  to 
processors  is  depicted  in  Figure  3(b).  The  sub-block  Fap  of  F  is  assigned  to  one  processor,  say  Pap, 
where  Xa  and  X  p  are  subsets  of  atoms  in  the  system.  Processor  Pap  will  require  coordinates  of 
atoms  Xa  and  X  p  to  calculate  the  partial  non-bonded  force  fap.  The  non-bonded  force  fa  can  then  be 
calculated  by  combining  the  partial  forces  where  i  =  1,  number  of  sub-blocks  in  row  a. 

Figure  3(b)  presents  a  regular  force  decomposition  of  force  matrix  F.  Even  numbers  of  atoms  are 
assigned  to  processors.  This  regular  decomposition  will  cause  load  imbalance,  since  some  processors 
have  only  half  of  the  computations  and  some  are  idle.  In  order  to  improve  load  balance,  an  irregular 
force  decomposition  algorithm  is  implemented  in  two  steps. 

Step  1  Processors  are  divided  into  groups  and  a  spatial  partitioner  is  applied  to  distribute  atoms  irregu¬ 
larly  to  the  processor  groups  so  that  interprocessor  communication  is  minimized  and  computation 
load  is  balanced.  This  step  is  equivalent  to  applying  the  atom  decomposition  algorithm  to  partition 
force  matrix  F  over  processor  groups. 

Step  2  Each  subset  of  the  force  matrix  is  then  partitioned  evenly  over  processors  in  the  same  group. 
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Table  1:  Regular  vs.  Irregular  Force  Decomposition 


Processors 

Regular 

Irregular 

16 

32 

64 

128 

16 

32 

64 

128 

Non-Bonded 

List 

Max. 

1203882 

1003341 

704616 

473341 

420214 

210490 

106916 

54344 

Avg. 

418801 

209400 

104700 

52350 

418801 

209400 

104700 

52350 

Idle  Processors 

6 

12 

28 

56 

0 

0 

0 

0 

Table  1  compares  the  partitioning  results  of  regular  and  irregular  force  decomposition  schemes  for  the 
benchmark  case,  MbCO  +  3830  water  molecules  (total  14026  atoms),  that  will  be  used  in  Section  5. 
The  size  of  the  non-bonded  list  is  6700818  at  the  beginning  of  simulation.  Table  1  lists  the  maximum 
and  average  sizes  of  non-bonded  lists  assigned  to  processors  and  the  number  of  idle  processors. 

Figure  4  depicts  an  irregular  force  decomposition  of  matrix  F  of  16  atoms  over  four  processors.  At 
step  1,  processors  are  divided  to  two  groups,  processor  1  and  2  in  group  Pn  and  processors  3  and  4 
in  P34.  The  force  matrix  F  in  Figure  4(a)  is  partitioned  irregularly  to  two  sub-matrices,  /') 2  and  f’34. 
Then  sub-matrix  /') 2  is  assigned  to  processor  group  / J|  2 ,  and  T34  to  P34,  as  shown  in  Figure  4(b).  Step  2 
further  partitions  sub-matrices  F12  and  /■  34  and  assigns  them  to  processors.  Figure  4(c)  shows  the  final 
result  of  the  irregular  force  decomposition. 

4.3  Loop  Preprocessing 

Since  arrays  are  distributed  over  processors,  preprocessing  has  to  be  carried  out  to  calculate  what  off- 
processor  elements  need  to  be  fetched  and  where  the  elements  will  be  stored  once  they  are  received.  The 
preprocessing  ( 1 )  examines  the  indirection  arrays  of  both  bonded  and  non-bonded  force  calculations,  (2) 
translates  the  indexes  from  global  to  local,  and  (3)  generates  a  communication  schedule  for  gathering 
and  scattering  off-processor  data. 

Indirection  arrays  for  bonded  force  calculation  loops  remain  unchanged  while  the  non-bonded  list 
adapts  during  computation.  Hence,  preprocessing  for  bonded  force  calculation  loops  need  not  be 
repeated,  whereas  it  must  be  repeated  for  non-bonded  force  calculation  loops  whenever  the  non-bonded 
list  changes.  CHAOS  provides  hash  tables  and  stamps  to  handle  codes  with  adaptive  indirection  arrays. 
While  building  schedules,  each  indirection  array  is  hashed  with  a  unique  time  stamp.  The  hash  table 
is  used  to  remove  any  duplicate  off-processor  references.  When  the  non-bonded  list  is  regenerated, 
non-bonded  list  entries  in  the  hash  table  are  cleared  with  the  corresponding  stamp.  Then  the  stamp  can 
be  reused  and  the  new  non-bonded  list  entries  are  hashed  with  the  reused  stamp. 


1  2  3  4  5  6  7  8  9  10  11  12  13  14  15  16 


2  3  4  5  6  7  8  9  10  11  12  13  14  15  16 


2  3  4  5  6  7  8  9  10  11  12  13  14  15  16 


(a)  Force  Matrix  F 


(b)  Result  of  Step  1 


1  23456789  10  11 


12  13  14  15  16 


1  2  3  4  5  6  7  8  9  10  11  12 


13  14  15  16 


(c)  Result  of  Irregular  Force  Decomposition 


Figure  4:  Irregular  Force  Decomposition 

4.4  Communication  Optimization 

In  Charmm  the  same  data  are  accessed  by  several  consecutive  loops.  The  coordinates  of  atoms  are 
used  in  both  bonded  and  non-bonded  force  calculation  loops.  The  atom  coordinates  are  only  updated 
at  the  end  of  each  time  step.  Hence,  all  of  the  off -processor  coordinates  needed  for  the  calculation  can 
be  obtained  at  the  beginning  of  the  time  step.  This  makes  it  advantageous  to  perform  computations 
without  reading  the  same  data  more  than  once. 

The  CHAOS  primitives  can  be  used  to  track  and  reuse  off-processor  data  copies.  The  primitives 
use  hash  tables  to  omit  duplicate  off-processor  data  references.  In  this  implementation,  one  single 
communication  schedule  was  generated  for  the  loops  in  the  dynamic  energy  calculation  routine  based 
on  the  content  of  the  hash  table.  This  schedule  was  used  to  gather  off -processor  data  at  the  beginning 
and  scatter  the  results  at  the  end  of  each  time  step  of  energy  calculation. 
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Table  2:  Communication  Overheads 


Number  of  Processors 

16 

32 

64 

128 

Block 

_l 

- 

293.9 

309.9 

Cyclic 

324.4 

363.5 

365.5 

396.4 

Weighted  BLOCK 

425.8 

436.3 

497.5 

572.6 

Rcb 

- 

155.2 

166.8 

198.8 

Weighted  Rcb 

147.1 

159.8 

181.1 

219.2 

5  Experiments  and  Results 

This  section  describes  the  experiments  performed  and  presents  the  corresponding  results.  A  number  of 
different  experiments  were  performed  for  data  and  iteration  partitioning.  The  results  are  presented  to 
show  how  the  performance  scales  from  small  to  large  numbers  of  processors.  All  the  timing  results, 
except  the  load  balance  index,  are  presented  in  seconds. 

The  performance  of  molecular  dynamics  simulations  was  studied  with  a  benchmark  case,  MbCO 
+  3830  water  molecules  (total  14026  atoms),  on  the  Intel  iPSC/860.  It  ran  for  1000  steps  with  40 
non-bonded  list  updates.  The  cutoff  for  non-bonded  list  generation  was  14  A.  The  non-bonded  list  was 
regenerated  40  times  during  the  simulation.  The  recursive  coordinate  bisection  (Rcb)  partitioner  was 
used  to  partition  atoms. 

5.1  Performance 

In  this  section,  the  impact  on  performance  of  different  data  and  iteration  partitioning  approaches  is 
presented. 

Communication  Overheads 

Distribution  of  data  arrays  has  significant  impact  on  performance  because  it  determines  the  patterns 
of  communication  between  processors  and,  possibly,  load  balance.  Several  regular  and  irregular  data 
partitioning  methods  have  been  implemented  to  compare  the  communication  overheads.  Table  2  presents 
average  communication  times  of  different  data  partitioning  methods  from  16  to  128  processors.  Atom 
decomposition  was  used  as  the  iteration  partitioning  algorithm. 

Both  BLOCK  and  Cyclic  are  regular  distributions  and  partition  arrays  evenly  over  processors. 
BLOCK  divides  an  array  into  contiguous  chunks  of  size  N/P  and  assigns  one  block  to  each  processor, 
whereas  Cyclic  specifies  a  round-robin  division  of  an  array  and  assigns  every  Pth  element  to  the 
same  processor.  Table  2  shows  that  both  BLOCK  and  Cyclic  do  not  exploit  locality  and,  therefore, 
'Not  available  due  to  memory  limitation  of  iPSC/860 
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Table  3:  Load  Balance  Indexes 


Number  of  Processors 

16 

32 

64 

128 

Block 

- 

- 

3.13 

3.08 

Cyclic 

1.08 

1.12 

1.21 

1.28 

Weigh.  Block 

1.05 

1.07 

1.10 

1.12 

Rcb 

- 

1.66 

2.35 

2.38 

Weigh.  Rcb 

1.01 

1.04 

1.06 

1.08 

cause  higher  communication  overheads.  Weighted  BLOCK  divides  an  array  into  contiguous  chunks 
with  different  sizes  so  that  each  chunk  would  have  the  same  amount  of  computational  work.  The 
communication  cost  of  weighted  BLOCK  is  higher  than  that  of  BLOCK  distribution,  but  weighted  BLOCK 
distribution  has  better  load  balance.  Rcb  and  weighted  Rcb  are  irregular  distributions  that  partition 
atoms  based  on  coordinates.  Both  distributions  exploit  the  locality  of  data  accesses,  and  hence  have 
lower  communication  costs. 

Load  Balance 

Table  3  presents  the  load  balance  indexes  of  different  data  partitioning  algorithms.  The  load  balance 
index  (LBI)  is  calculated  as 

(max”=1  computation  time  of  processor  i)  X  ( number  of  processors  n ) 
computation  time  of  processor  i 

Both  BLOCK  and  Cyclic  distributions  do  not  consider  computation  load  of  each  element  when  dis¬ 
tributing  data  arrays  over  processors,  and  hence  both  cause  severe  load  imbalance.  Weighted  BLOCK 
divides  arrays  into  different  sizes  of  blocks  based  on  the  computation  load  of  each  element  so  that  good 
load  balance  is  obtained.  The  irregular  distribution  generated  by  weighted  Rcb  achieves  good  load 
balance  because  the  weighted  version  of  Rcb  assigns  nearly  the  same  amount  of  computation  load  to 
each  processor. 

Performance 

Table  4  and  Table  5  present  the  performance  of  Charmm  when  atom  decomposition  and  force 
decomposition  algorithms  are  applied  for  iteration  partitioning,  respectively.  The  execution  time  in¬ 
cludes  the  energy  calculation  time  and  communication  time  of  each  processor.  The  computation  time 
is  the  average  of  the  computation  time  of  dynamics  simulations  over  processors;  the  communication 
time  is  the  average  communication  time.  As  predicated  in  Section  4.2,  force  decomposition  has  lower 
communication  overheads  than  atom  decomposition.  The  results  also  show  that  Charmm  scales  well 
and  that  good  load  balance  is  maintained  up  to  128  processors. 


11 


Table  4:  Performance  of  Parallel  Charmm  (Atom  Decomposition) 


Number  of  Processors 

1 

16 

32 

64 

128 

Execution  Time 

1 

4356.0 

2293.8 

1261.4 

781.8 

Computation  Time 

74595.5 

4099.4 

2026.8 

1011.2 

507.6 

Communication  Time 

0 

147.1 

159.8 

181.1 

219.2 

Load  Balance  Index 

1.00 

1.03 

1.05 

1.06 

1.08 

Table  5:  Performance  of  Parallel  Charmm  (Force  Decomposition) 


Number  of  Processors 

1 

16 

32 

64 

128 

Execution  Time 

74595.5 

4433.6 

2296.7 

1239.0 

750.4 

Computation  Time 

74595.5 

4240.0 

2096.8 

1039.8 

518.9 

Communication  Time 

0 

134.8 

122.3 

137.5 

188.1 

Load  Balance  Index 

1.00 

1.01 

1.04 

1.06 

1.08 

5.2  Preprocessing  Overheads 

Data  and  iterations  partitioning,  remapping,  and  loop  preprocessing,  must  be  done  at  runtime.  Prepro¬ 
cessing  overheads  of  the  simulation  are  shown  in  Table  6.  The  data  partition  time  is  the  execution  time 
of  Rcb.  All  arrays  associated  with  atoms  have  to  remapped  and  redistributed  when  the  new  distribution 
of  atoms  is  known.  The  remapping  and  preprocessing  time  shows  the  overhead.  After  partitioning 
atoms,  the  non-bonded  list  is  regenerated.  This  non-bonded  list  regeneration  was  performed  because 
atoms  were  redistributed  over  processors  and  it  was  done  before  dynamics  simulation  occurred.  In 
table  6,  the  regeneration  time  is  shown  as  non-bonded  list  update  time. 

During  simulation,  non-bonded  list  was  regenerate  periodically.  When  the  non-bonded  list  was 
updated,  the  schedule  must  be  regenerated.  The  schedule  regeneration  time  in  Table  6  gives  the 
total  schedules  regeneration  time  for  40  non-bonded  list  updates.  This  overhead  was  included  in  the 
computation  times  in  Table  4  and  Table  5.  By  Comparing  these  numbers  to  those  in  Table  5,  it  can  be 
observed  that  the  preprocessing  overhead  is  relatively  small  compared  to  the  total  execution  time. 

5.3  Scalability 

The  scalability  of  the  parallel  Charmm  was  tested  on  Intel  Delta.  Table  7  shows  the  performance 
of  Charmm  from  64  to  512  processors.  The  execution  times  and  load  balance  indexes  in  the  table 
demonstrate  that  Charmm  can  achieve  high  performance  on  large  numbers  of  processors  and  maintain 
good  load  balance. 

2Estimate  (based  on  model  systems  due  to  memory  limitations  with  1  node)  [4] 
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Table  6:  Preprocessing  Overheads 


Number  of  Processors 

16 

32 

64 

128 

Data  Partition 

0.27 

0.47 

0.83 

1.63 

Non-bonded  List  Update 

7.18 

3.85 

2.16 

1.22 

Remapping  and  Preprocessing 

0.03 

0.03 

0.02 

0.02 

Schedule  Generation 

1.31 

0.80 

0.64 

0.42 

Schedule  Regeneration  (x40) 

43.51 

23.36 

13.18 

8.92 

Table  7:  Performance  on  Intel  Delta 


Number  of  Processors 

Atom  Decomposition 

Force  Decomposition 

64 

128 

256 

512 

64 

128 

256 

512 

Execution  Time 

1226.2 

712.3 

504.6 

458.3 

1187.8 

687.3 

460.7 

397.4 

Load  Balance  Index 

1.08 

1.11 

1.20 

1.48 

1.05 

1.07 

1.09 

1.15 

Table  7  also  demonstrates  that  the  irregular  force  decomposition  outperforms  the  atom  decomposition 
by  an  increasing  margin  as  the  number  of  processors  increases.  Force  decomposition  is  3  percent  faster 
than  the  atom  decomposition  on  64  processors,  whereas  force  decomposition  is  15  percent  faster  than 
atom  decomposition  on  512  processors. 

5.4  Portability 

The  CHAOS  library  has  been  ported  to  several  distributed  memory  systems,  such  as  Cray  T3D,  IBM 
SP-2,  and  Intel  Paragon,  etc.  Using  these  libraries,  the  parallel  Charmm  code  has  been  executed  on  the 
Cray  T3D,  IBM  SP-2,  and  Intel  Paragon.  The  performance  results  of  (irregular  force  decomposition) 
Charmm  on  Cray  T3D,  IBM  SP-2,  and  Intel  Paragon  are  presented  in  the  Table  8,  Table  9,  and  Table  10, 
respectively.  For  communication,  Cray  T3D  uses  the  Parallel  Virtual  Machine  (PVM)  library,  IBM  SP-2 
uses  the  Message  Passing  Library  (MPL),  and  Intel  Paragon  uses  the  Paragon  OSF/1  Message-Passing 
System  Calls.  The  computation  time  on  IBM  SP-2  is  only  about  half  of  the  computation  times  on 


Table  8:  Performance  on  Cray  T3D  (PVM) 


Number  of  Processors 

16 

32 

64 

128 

Execution  Time 

2973.1 

1572.3 

790.6 

544.5 

Computation  Time 

2882.8 

1449.5 

637.6 

318.6 

Communication  Time 

52.6 

69.6 

113.4 

204.3 

Load  Balance  Index 

1.01 

1.04 

1.06 

1.06 
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Table  9:  Performance  on  IBM  SP-2  (MPL) 


Number  of  Processors 

16 

32 

64 

128 

Execution  Time 

759.4 

413.1 

338.1 

Computation  Time 

1332.6 

674.8 

312.2 

165.3 

Communication  Time 

48.4 

51.4 

76.3 

146.9 

Load  Balance  Index 

1.02 

1.05 

1.08 

1.16 

Table  10:  Performance  on  Intel  Paragon  (Massage-Passing  System  Calls) 


Number  of  Processors 

16 

32 

64 

128 

Execution  Time 

3472.2 

1863.4 

972.5 

579.1 

Computation  Time 

3229.4 

1629.2 

778.9 

391.1 

Communication  Time 

180.3 

165.4 

154.5 

165.6 

Load  Balance  Index 

1.02 

1.04 

1.05 

1.06 

Intel  Paragon  and  Cray  T3D.  In  addition,  the  communication  costs  on  Intel  Paragon  and  Cray  T3D  are 
slightly  worse  than  that  on  IBM  SP-2  for  comparable  numbers  of  processors.  The  net  effect  is  that  IBM 
SP-2  is  substantially  faster  when  compared  to  the  execution  times  on  Intel  Paragon  and  Cray  T3D. 

CHAOS  can  also  be  ported  to  distributed  shared  memory  systems.  For  instance,  a  subset  of  the 
CHAOS  runtime  library  has  been  ported  to  Cray  T3D  by  using  the  Shmem  shared  memory  library4.  Data 
on  remote  processors  can  be  referenced  directly  via  support  circuitry  on  each  node  and  interconnect 
network.  The  latency  of  each  remote  access  is  1  to  2  microseconds.  The  performance  of  CFIARMM  on 
Cray  T3D  using  Shmem  library  is  shown  in  Table  1 1 .  The  low  latency  of  remote  access  makes  Cray  T3D 
scale  very  well.  The  communication  overhead  goes  down  when  the  number  of  processors  increases, 
since  each  processor  accesses  fewer  remote  data.  Furthermore,  the  communication  overhead  on  (shared- 
memory)  Cray  T3D  is  dramatically  small  when  compared  to  the  overheads  on  message -passing  systems, 
which  usually  have  large  message  startup  latencies. 


’Timings  were  measured  on  the  SP-2  with  Powerl  nodes  (located  at  Argonne  National  Laboratory) 

4A  shared  memory  access  library  released  by  Cray  Research,  Inc.  as  part  of  the  CRAFT  compiler  environment 
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Table  1 1 :  Performance  on  Cray  T3D  (SHMEM  Library) 


Number  of  Processors 

16 

32 

64 

128 

Execution  Time 

2647.7 

1357.8 

698.4 

353.0 

Computation  Time 

2591.1 

1297.2 

647.0 

321.9 

Communication  Time 

21.0 

14.3 

10.9 

OO 

O 6 

Load  Balance  Index 

1.01 

1.04 

1.06 

1.07 

6  Related  Work 

There  has  been  considerable  effort  by  researchers  to  exploit  parallelism  of  molecular  dynamics  computa¬ 
tions  on  various  parallel  architectures.  Several  parallel  algorithms  have  been  proposed  and  implemented 
on  Mimd  systems  [4].  It  has  also  been  observed  that  spatial  domain  decomposition  algorithms  can  be 
applied  to  achieve  good  load  balance  [8].  More  recently  there  have  been  scalable  algorithms  created 
for  large  Mimd  multiprocessor  systems  [6.  9]. 

Several  widely  used  MD  programs  have  been  parallelized  on  Mimd  computers.  Clark  et  al.  [9] 
applied  spatial  decomposition  algorithms  to  develop  EulerGromos,  a  parallelization  of  the  GROMOS 
MD  program.  Brooks  and  Hodoscek  implemented  a  parallel  version  of  Charmm  using  the  replicated 
data  approach  [4].  The  parallelization  effort  of  Charmm  done  by  Das  and  Saltz  [10]  was  based  on  the 
data-distributed  approach.  All  data  arrays  associated  with  atoms  were  partitioned  over  processors  in 
BLOCK  distribution.  This  implementation  did  not  achieve  good  performance  and  scalability  because 
BLOCK  distribution  did  not  exploit  locality. 

Plimpton  proposed  the  regular  force  decomposition  algorithm  for  parallel  short-range  MD  compu¬ 
tation  [6].  The  algorithm  was  adapted  in  this  paper  to  decompose  force  matrices  irregularly  in  order  to 
achieve  good  load  balance. 

CHAOS  has  been  used  to  parallelize  irregular  scientific  computation  application  programs,  such  as 
computational  chemistry  and  computational  fluid  dynamics  (CFD).  CHAOS  has  also  been  used  by  several 
parallel  compilers  to  handle  irregular  computations  on  distributed  memory  systems  [11,  12]. 

7  Conclusions 

This  paper  presented  an  implementation  of  the  Charmm  program  that  scaled  well  on  Mimd  distributed 
memory  machines.  This  implementation  was  done  by  using  a  set  of  efficient  runtime  primitives  called 
the  CHAOS  runtime  support  library.  Both  the  data  decomposition  and  force  decomposition  algorithms 
were  implemented  and  compared. 

Although  the  original  force  decomposition  algorithm  partitioned  atoms  regularly  over  processors, 
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the  non-bonded  list  was  distributed  unevenly  and  hence  severe  load  imbalance  occurred.  An  irregular 
force  decomposition  algorithm  was  introduced  and  implemented  in  this  paper.  Processors  were  divided 
into  groups  and  the  atom  decomposition  algorithm  was  applied  to  distributed  force  matrix  F  onto 
processor  groups.  Each  subset  of  the  force  matrix  was  then  partitioned  evenly  over  processors  in  the 
same  group.  This  paper  showed  that  the  irregular  force  decomposition  algorithm  could  achieve  good 
performance  and  load  balance. 

The  performance  results  showed  that  the  irregular  force  decomposition  algorithm  had  lower  com¬ 
munication  overheads  and  better  load  balance  on  large  numbers  of  processors.  The  irregular  force 
decomposition  outperformed  the  atom  decomposition  by  an  increasing  margin  as  the  number  of  proces¬ 
sors  increased.  Force  decomposition  is  3  percent  faster  than  the  atom  decomposition  on  64  processors, 
whereas  force  decomposition  is  15  percent  faster  than  atom  decomposition  on  512  processors.  The 
load  balance  indexes  of  the  irregular  force  decomposition  and  atom  decomposition  algorithms  on  512 
processors  were  1.15  and  1.48,  respectively. 

Datapartitioning  is  important  for  parallelization  because  it  determines  the  patterns  of  communication 
among  processors  and  load  balance.  Appropriate  data  partitioning  schemes  can  exploit  the  data  locality 
and  reduce  interprocessor  communication  overheads. 

The  CHAOS  library  provides  a  set  of  runtime  primitives  that  (1)  couples  parti tioners  to  the  application 
programs,  (2)  remaps  data  and  partitions  work  among  processors,  and  (3)  optimizes  interprocessor 
communications.  The  runtime  primitives  can  efficiently  handle  both  regular  and  irregular  distributions 
of  data  and  iterations.  The  implementation  of  the  atom  decomposition  and  irregular  force  decomposition 
algorithms  was  straightforward  -  simply  embedding  CHAOS  runtime  primitives  in  the  original  code  at 
the  appropriate  locations.  This  implementation  of  parallel  Charmm  presented  an  application  of  CHAOS 
that  can  be  used  to  support  efficient  execution  of  irregular  problems  on  distributed  memory  machines. 
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